Skip to content

Conversation

@geruh
Copy link
Contributor

@geruh geruh commented Jan 16, 2026

Related to #2255.

Rationale for this change

This PR is a piece of the existing DFI PR in #2255. However, this rips out the existing delete->data matching behavior for deletes and indexes them for efficient lookup.

The previous implementation:

  1. Scanned all delete files with sequence number >= data file's sequence number
  2. Created a new _InclusiveMetricsEvaluator instance for each data file
  3. Evaluated every candidate delete file against the data file's path

Now we extend this workflow with a DeleteFileIndex that:

  • INdexes path specific DVs
  • Indexes partition-scoped deletes by (spec_id, partition record)
  • Uses bisect_left for sequence number filtering

This aligns with the Java implementation of the DeleteFileIndex, following the python infra.

Are these changes tested?

New tests added and existing tests continue to pass

Are there any user-facing changes?

No

Copy link
Contributor

@jayceslesar jayceslesar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically left 2 nits, existing integration tests are passing which gives confidence and the unit tests also look good here

Comment on lines +1859 to +1860
def _match_deletes_to_data_file(data_entry: ManifestEntry, delete_file_index: DeleteFileIndex) -> set[DataFile]:
"""Check if delete files are relevant for the data file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this function anymore?

Comment on lines +87 to +91
if lower and upper and lower == upper:
try:
return lower.decode("utf-8")
except (UnicodeDecodeError, AttributeError):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using contextlib.suppress here instead of the except pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants